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Abstract 

The standard approach to rank the performance of several classifiers for a given classification problem 
is via an independent labeled validation dataset. However, in various applications only unlabeled data 
and several pre-constructed classifiers are provided, without access to labeled training or validation data. 
This begs the following questions: given only the predictions of several classifiers over a large set of 
unlabeled test data, is it possible to a) reliably rank their expected performances? and b) construct a 
meta-classifier more accurate than any individual classifier in the ensemble? 

Here we present a spectral approach to address these questions. First, assuming errors of different 
classifiers are statistically independent, we show that the off-diagonal terms of their covariance matrix 
correspond to a rank-one matrix. Moreover, the entries of its leading eigenvector are proportional to 
the (balanced) accuracies of the classifiers. Second, using this eigenvector and without labeled data, we 
construct a novel spectral meta-learner (SML), which is a weighted linear combination of the classifiers in 
the ensemble. We interpret our SML as an approximation of the maximum likelihood estimator (MLE). 
Not only does SML typically achieve a higher accuracy than most classifiers in the ensemble, it also 
provides a better starting point for iterative estimation of the MLE than majority voting. Further, we 
show that SML is robust to the presence of small malicious groups of classifiers designed to veer the 
ensemble prediction away from the (unknown) ground truth. We demonstrate our unsupervised methods 
on several simulated and real datasets. 

Introduction 

Imagine the following student's dilemma: a student is taking an exam, unprepared. However, during 
the test, the student gains access to the answers of fellow classmates. Expectedly, there is some disagree- 
ment between their answers. How should the student proceed to identify who, among the classmates, will 
get the highest grade? Is it possible for the student to cleverly combine the answers of his/her classmates 
and pass the exam with a grade better than all of them? 

The first question above corresponds to the problem of estimating prediction performances of pre- 
constructed classifiers (e.g. fellow classmates) in absence of class labels. Namely, each classifier was 
constructed independently on a potentially different training dataset (e.g. each classmate studied on 
his/her own) and they are all being applied to a new test data D (e.g. the exam) for which labels are 
not available. In addition, the performance of each classifier on its own training data is unknown. This 
setting is markedly different from the typical supervised machine learning setting. There, classifiers are 
ranked after the class labels on the test dataset are disclosed in order to evaluate prediction performances. 
In the student's dilemma, classifiers are ranked based on an estimate of their prediction performance, 
inferred without any access to the class labels. 

The second question may be addressed by a majority voting approach, which was used even in ancient 
times \\\. More recently this question was formulated as an iterative likelihood maximization procedure, 
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as exemplified by Dawid and Skene [2]. We note that if we had external knowledge or historical data 
to weigh the contribution of classifiers we could use other well-established apparoaches such as panels 
of experts [3j[4J, or forecast combinations [5]; however, this knowledge is not available in the student's 
dilemma and thus these solutions cannot be used to address our problem. 

In recent years iterative likelihood maximization solutions were succesfully applied to crowdsourcing 
probelms, where multiple annotators with unknown degrees of expertise are requested to provide anno- 
tations of instances [6 -13 . The focus of crowdsourcing is however different, since beyond the problem of 



inferring annotator's accuracies, a major challenge is how to optimally decide on the number of annota- 
tors and how to assign instances to them. These problems do not arise in the student's dilemma setting, 
where wc assume that predicted labels can be obtained for all test data at virtually no cost from either 
human evaluators or algorithms/machine learning programs. Hence, our student's dilemma setting can 
be seen as the full-data crowdsourcing case where all annotators provided predictions for all instances in 
the dataset. 

In this paper we present four major contributions: 

1. Under standard independence assumptions between classifier errors, we prove that in the limit of 
an infinite test set, the off-diagonal entries of the population covariance matrix of all classifiers 
correspond to a rank-one matrix. 

2. We show that the entries of the first eigenvector of this rank-one matrix are proportional to the 
balanced accuracies of the classifiers. Thus, a spectral decomposition of this rank-one matrix 
provides a fast approach to sort the performances of an ensemble of classifiers. To the best of our 
knowledge, this gives the first computationally efficient and asymptotically consistent solution to 
the classical problem posed by Dawid and Skene [2] in 1979, for which thus far only non-convex 
iterative likelihood maximization solutions have been proposed [8 14 17 . 



We propose the Spectral Meta-Learner (SML): A new, easy to construct unsupervised ensemble- 
learner. Not only does SML typically performs better than most classifiers in the ensemble or their 
majority vote, it is also a better starting point for maximum likelihood estimation (MLE) using the 
expectation-maximization (EM) algorithm. 

We show that SML is robust to the presence of conspiring classifiers (representing a cartel or an 
interest group), which maliciously attempt to veer the overall ensemble solution away from the 
(unknown) ground truth. 



1 Problem setup 

Let {/ili^i be M binary classifiers, whose inputs belong to some instance space X (typically X = M. d ). 
We assume that each classifier fi was trained individually in a manner undisclosed to us using its own 
labeled training set, which is also unavailable to us. Thus, we view each classifier as a black-box function 
fi'.X—* {—1, 1} with an unknown classification performance. 

Let D = {xk}f =1 C X be a test set of S unlabeled samples, y = (yi, . . . , ys) be their true (unknown) 
labels, and fi(xk) be the label predicted by the i-th. classifier at Xk- 

Given only the predictions of the M classifiers on the unlabeled set D and no other labeled data, wc 
consider the following two problems: i) rank the performances of the M classifiers; and ii) construct an 
improved estimate y = (yi, . . . , ys) of the label vector y. 



2 Ranking of classifiers 

We first introduce some notation and state our assumptions. Let (X, Y) e X x { — 1,1} be a random 
vector corresponding to our binary classification problem, p(x,y) its probability density function, and 
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p(x) the marginal of X. 

In the present study, we measure the performance of a binary classifier / by its balanced accuracy 
7r, defined as 

sensitivity + specificity 1 
7r = — 2 -^W' + W ^ ^ 

where '0 and 77 are the sensitivity and specificity, respectively, of the classifier /, 

t/> = Pr[/(X)=y|y = l], and <q = Pr[f(X)=Y\Y = -l] (2) 

Balanced accuracy is a common measure of quality of a classifier, in particular when one class label is 
much more abundant than the other. As discussed below, in our setting balanced accuracy arises as the 
natural measure to consider. 

In our analysis we make the following two assumptions: i) The S unlabeled samples Xk £ D arc i.i.d. 
realizations from the marginal distribution p(x); and ii) the M classifiers are statistically independent, in 
the sense that prediction errors made by one classifier are independent of those made by any of the other 
classifiers. Namely, for all 1 < i 7^ j < M, and for each of the two class labels, with y, a%,aj £ {—1, 1} 

Pr[f i (X)=a i J j W=a j \Y] = Pr[/ i (X) = a i |y]Pr[/ i (X)=a i |Y]. (3) 

Note that these assumptions are standard both in the development of supervised ensemble methods [18] , 
as well as in other works considering a setting similar to ours [2 13 



To understand how we may rank the classifiers without labeled data, it is instructive to consider the 
population setting, whereby the number of unlabeled test data tends to infinity, \D\ = S — > 00. Let Q 
be the M x M population covariance matrix of the M classifiers, with entries 

qi^EMW-HXfjW-nj)] (4) 

where E denotes expectation with respect to the density p(x,y) and ^ = E[/j(X)]. 

The following lemma, proved in the supplementary information, characterizes the relation between 
the matrix Q and the balanced accuracies of the M classifiers: 

Lemma 2.1. The entries qij of Q are given by 



qij ~ \ (27T, - 1)(2tt j - 1) (l - b 2 ) otherwise [ ' 

where b G (—1,1) is the class imbalance, 

6 = Pr[y = l]-Pr[y = -l]. (6) 

The key insight from this lemma is that the off-diagonal entries of Q are identical to those of a 
rank-one matrix Q — Avv T with unit-norm eigenvector v and eigenvalue 

M 

A=(l-6 2 )-^(27T i -l) 2 (7) 

i=i 

Importantly, up to a sign ambiguity, the entries of v are proportional to the balanced accuracies, 

Vi CX (27T, - 1). (8) 



4 



Hence, the M classifiers can be ranked according to their balanced accuracies by sorting the entries of 
the eigenvector v. 

In practice, neither Q nor v are known, but both can be estimated from the finite unlabeled dataset 
D. We denote the corresponding sample covariance matrix by Q. Its entries are 

1 S 

fe=l 

where /}, = ^ ^2 k fi(xk)- Under our assumptions, Q is an unbiased estimate of Q, namely E[Q] = Q. 
Moreover, the variances of the off-diagonal entries of the sample covariance matrix Q are 

Var[ qiJ } = — J - + I Apum ~ JZT[ q ^ ) ■ ( 9 ) 

Finally, Q — > Q as S — > oo. Hence, for a sufficiently large unlabeled set D, it should be possible to 
accurately estimate the ranking of the M classifiers from Q. 

In fact, the discussion above suggests several ways to rank the M classifiers. One option is to look 
for a rank-one matrix R — Avv T , whose off-diagonal terms are closest to those of Q. While the rank-one 
constraint is non-convex, its standard relaxation to a trace constraint yields 

R = argmin^g.y - R^) 2 + 9Trace(R) (10) 
subject to R = R T , and R >z 0. This is a convex problem, which can be solved efficiently (in polynomial 



time in M) via semi-definite programming 19 



An alternative and more computationally efficient approach is to construct an estimator of Q, and 
then compute its leading eigenvector v. Given that K[Q] = Q, we estimate the off-diagonal entries of Q 
by those of Q. As for the diagonal entries, note that upon the change of variables qu = e**, for all i ^ j 

loglftjl -U- tj = 0. 

In the finite sample setting, we replace the unknown by q~ij and look for an M-dimensional vector 
t such that the relation above holds approximately for all pairs i ^ j, 

i = arg min ^(log \q io \ - U - ij) 2 (11) 

j>i 

The vector t is efficiently found by solving an M x M system of linear equations. Since —> qij as 
S — > oo, it follows that t is an asymptotically consistent estimate of t, and consequently the resulting v 
is a consistent estimate of v. 

In practice, to avoid the singularity at zero of the logarithm function, we modify ( |11[ ) by summing 
only over indices i,j for which | > 2y/Var[qij], where Var[gij] is a plug-in estimator of (j9J). Once t 
is found, we construct the estimate of Q and rank the M classifiers by its leading eigenvector v. 

Finally, an even simpler approach is to rank the classifiers by dire ctly computing the leading eigenvec- 



tor of Q. For a finite number of classifiers M, it follows from Lemma 2.1 that as S —> oo this approach is 
in general not consistent. However, as the following lemma shows, if M is large this leading eigenvector 
is close to the true one. 



Lemma 2.2. Let w be the leading unit-norm eigenvector of the population matrix Q, and let A be given 
by 0. Then, 

(w T v) 2 >l-X ( 12 ) 



5 



A proof of Lemma |2.2| is provided in the Supplementary Information. Note that if all classifiers in 
the ensemble have a balanced accuracy bounded away from 1/2, then A = O(M) and the angle between 
v and w is small. 

Ranking classifiers by a singular value decomposition of the S x M matrix of predicted labels fi(xk) 
was recently suggested in |6], where the j-th entry in the leading right singular vector was considered 
a proxy for the reliability of the j-th classifier. Our work provides a novel probabilistic interpretation 
to their approach, as it shows that the entries of w (also the leading right singular vector of the matrix 
fi(xk)) are approximately those of v, which in turn are proportional to the balanced accuracies of the 
classifiers. Consistent with the analysis above, in our simulations we found that all three approaches 
(SDP (10), least-squares problem (11) and direct eigen-decomposition of Q) gave comparable rankings, 
though the latter was slightly less accurate. 



3 The Spectral Meta Learner (SML) 

Next, we turn to the problem of constructing a meta-learner expected to be more accurate than any of 
the M classifiers in the ensemble. In our setting, this is equivalent to estimating the S unknown labels 
yi, . . . , ys by combining the labels predicted by the M classifiers. 

The standard approach to this task is to determine for all the unlabeled instances the maximum 
likelihood estimator (MLE) y ML of their true class labels y [2] . Under the assumption of independence 
between classifier errors and independence between instances, the overall likelihood is the product of the 
likelihoods of the S individual instances, where the likelihood of a label y for an instance x is 

M 

2(f 1 (x),...f M (x);y) = Y[Pr[Mx)\y}. (13) 

i=l 

As shown in the Supplemental Information, the MLE can be written as a weighted sum of the binary 
labels fi(x) G {—1, 1}, with weights that depend on the sensitivities ipi and specificities r\i of the classifiers. 
For an instance x, 



y 



( ML > = & Tgma X ]n£(f 1 (x),...,f M (x);y) 



y 



sign f*( x ) lo § tti + lo S P^j 



(14) 



with 



(1 - ipi)(l - 7]i) rjiil-rji) 



Equation (14) shows that the MLE is a linear ensemble classifier, whose weights depend, unfortunately, 
on the unknown specificities and sensitivities of the M classifiers. 

The common approach, pioneered by Dawid and Skene [2], is to jointly maximize the likelihood of 
all S labels and the specificities and sensitivities of the M classifiers. Given an estimate of the true 
class labels, it is straightforward to estimate each classifier sensitivity and specificity. Similarly, given 



estimates of ipi and rji, the corresponding estimates of y are easily found via (14). Hence, the MLE is 
typically approximated by expectation-maximization (EM) [8-11 13 



As is well known, the EM procedure is guaranteed to increase the likelihood at each iteration. However, 
its key limitation is that since the likelihood is in general a non-convex function, the EM iterations may 
converge to a local (rather than global) maximum of the likelihood function. 



() 



Importantly, the EM procedure requires an initial guess of the ground truth labels y. A common choice 
is the simple majority voting rule of the ensemble of classifiers. As noted in previous studies, majority 
voting may be highly suboptimal, and starting the EM procedure from it may lead to suboptimal local 
maxima 13 . Thus, it is desirable, and as described below in some cases crucial, to initialize the EM 
algorithm with an estimate y that is close to the true class label y. 

In this section we show that it is indeed possible to construct a more accurate initial guess, us- 
ing the eigenvector of the previous section. To this end we note that a Taylor expansion of the un- 



known coefficients and /3j in (15) around (ipiiVi) = (1/2,1/2) gives, up to second order terms 

o((V>i - 1/2) 2 , { m - 1/2) 2 , - 1/2) • (m - i/2)), 

OiW 1 + 4(^ + ^-1), ftal. (16) 



Next, recall that the balanced accuracy is ni = +ipi)/2. Hence, combining a Taylor expansion of (14 1 



around (ipi^rji) = (1/2, 1/2) with (16) and keeping only first order terms yields 



sign (^2h(x k )(2ir l - 1)J 



yi ML) «signl V/ifo)^-!) ] . (17) 



Recall that by Lemma 2.1 up to a sign ambiguity the entries of the first eigenvector of Q are pro- 
portional to the balanced accuracies of the classifiers, oc (27r$ — 1). This sign ambiguity can be easily 
removed if we assume, for example, that most classifiers are better than random. Replacing 2ni — 1 in 



(17) by the eigenvector entries V{ of an estimate of Q yields a novel spectral-based ensemble classifier, 



which we term the Spectral Meta Learner (SML), 

£f ML) =signl V./; (.,,,)■ r, I . UN) 




As we shall see in the simulation section, the SML is typically more accurate than majority voting, and 
provides a better initial guess for EM procedures that estimate the MLE. 



4 Learning in the Presence of a Malicious Cartel 

We now consider a scenario whereby for some r £ [0, 1/2), r-M classifiers belong to a conspiring cartel 
(e.g. representing a junta or an interest group), maliciously designed to veer the ensemble solution 
toward the cartel's target and away from the truth. The possibility of such a scenario raises the following 
question: how sensitive are SML and majority voting to the presence of a cartel? In other words, to 
what extent can these methods remove, or at least substantially reduce the effect of the cartel classifiers, 
without knowing their identity or applying sophisticated clustering algorithms to identify them. 

To this end, let us first introduce some notation. Let the ensemble of M classifiers be composed of a 
set P of (1 — r)M "honest" classifiers and a set C of rM malicious cartel classifiers. The honest classifiers 
satisfy the assumptions of the previous section: each classifier attempts to correctly predict the truth 
with a balanced accuracy 7Ti, and different classifiers make independent errors. The cartel classifiers, 
in contrast, attempt to predict a different target labeling, T. We assume that conditional on both the 
cartel's target and the true label, the classifiers in the cartel make independent errors. Namely, for all 
i, j G C, and for any labels aj, dj € {—1, 1} 

Pr[/ 4 (A) = a t , flX) = aj \T,Y] = Pr[#X) = a^T] Pr[/,(X) - a 3 \T] . (19) 

Finally, we assume that the prediction errors of cartel and honest classifiers are also independent. 
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The following lemma, proven in the supplementary information, characterizes the relation between 
the population matrix Q and the following quantities: the balanced accuracies of the M classifiers, the 
balanced accuracy ir c of the cartel's target with respect to the truth, and the balanced accuracies £j of 
the r-M cartel members relative to their target: 

Lemma 4.1. Given (1 — r)M honest classifiers and r ■ M classifiers of a cartel C, the entries of Q 
satisfy 

{1 — fj? i — j 

(2^-l)(2^-l)(l-6 2 ) ieP,jeP . . 

(2^-l)(2^ c -l)(20-i)(l-& 2 ) iePJeC (ZU) 
(2&-l)(20-l)(l-& 2 ) ieC,jeC 

where b e (—1, 1) is the class imbalance, as in 



Based on Lemma 4.1 the following theorem shows that in the presence of a single cartel, the off-diagonl 
entries of Q correspond to a rank-two matrix. We conjecture that in the presence of k independent cartels, 
the respective rank is (k + 1). 

Theorem 4.2. Given (1 — r)M honest classifiers and rM classifiers belonging to a cartel, < r < 1, 
the off-diagonal entries of Q correspond to a rank-two matrix with eigenvalues 



Ai = Ap cos 2 a + Ac sin /? 
A 2 = Ap sin 2 a + Xc cos 2 [3 



(21) 



and eigenvectors 



where 



and 



_ J (271-; - 1) cos a i e P (r)0 s 

M ~\ (26-1)0110 ieC [ > 

_( (2^-1) sin a ieP 

62i - \ (2&-l)cos/3 teC ' (Z6) 

Ap = (l-6 2 )^(2 7 r J -l) 2 , Ac = (l-& 2 )E( 2 ^- 1 ) 2 - ( 24 ) 

jeP jec 



( mI^Fi )' /3= Cretan (^=|) (25) 
*i = 2vr c - 1, fc 2 = A c /Ap (26) 



a = 5 arctan 



An intuitive interpretation of Theorem [42] is that the covariance matrix Q describes a two-dimensional 
subspace. The honest classifiers lie on a line with angle a relative to the eigenvector e\. The cartel 
classifiers lie on a line with angle f3 relative to the eigenvector e 2 . 

As an illustrative example, we consider the case where the cartel's target is uninformative with respect 
to the truth, i.e. tt c = 1/2. In this case a — (3 — 0, so Ai = Ap, A 2 = Ac and 

J 2^ - 1 ieP (07 . 
eii = \ tec (27) 

e » = {2 6 -! He ^ 
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Recall from ( 18 ) that SML weighs each classifier by the corresponding entry in the leading eigenvector. 
Hence, if the cartel's target is orthogonal to the truth (tt c = 1/2) and Xp > Xc, SML asymptotically 
ignores the cartel (Fig. SI). In contrast, regardless of 7r c , majority voting is affected by the cartel, 
proportionally to its fraction size r. Hence, SML is much more robust than majority voting to the 
presence of a cartel. 



5 Results 

This section contains two parts. First, we study our ranking and SML algorithms on simulated data 
for both an ensemble of independent classifiers, and an ensemble of independent classifiers corrupted 
by the presence of one cartel. In the second part, using standard machine learning algorithms as our 
collection of binary classifiers, we evaluate our methods on several real datasets from medical, biological 
and engineering applications. This second part shows that our methods are robust to deviations from 
the (unrealistic) strict independence assumptions on errors between classifiers. 



5.1 Simulated Data 

In our simulations we considered an unlabeled test data of size S = 600 instances, a ground truth with 
class imbalance 6 = and an ensemble of M = 100 classifiers. Each classifier had potentially different 
sensitivity and specificity chosen at random such that its balanced accuracy was uniformly distributed 
on the interval [0.3,0.8]. This setup was chosen to imitate a difficult learning problem with independent 
classifiers, some of which are worse than random. We note that classifiers that are worse than random 
may occur in real studies, where the training data is too small in size or not sufficiently representative of 
the test data. Finally, we considered the effect of a malicious cartel consisting of 33% of the classifiers, 
having their own target labeling. More details about the simulations are provided in the supplementary 
information. 

Ranking of Classifiers: We constructed the sample covariance matrix, corrected its diagonal according 



to (11) and computed its leading eigenvector v. In both cases (independent classifiers and cartel), with 
probability of at least 80%, the classifier with highest accuracy was also the one with the largest entry (in 
absolute value) in the eigenvector v, and with probability > 99% its inferred rank was among the top five 
classifiers (Fig. S2). We remark that even if the test data of size S = 600 were fully labeled, identifying 
the best performing classifier would still be prone to errors, since the estimated balanced accuracy has 
itself an error of 0{l/^fS). 

Unsupervised Ensemble-Learning: Next, for the same set of simulations we compared the balanced 
accuracy of majority voting and of our suggested SML. We also considered the predictions of these two 
meta- learners as starting points for iterative EM calculation of the MLE (iMLE). As shown in Fig. [I] 
SML was significantly more accurate than majority voting. Furthermore, applying an EM procedure 
with SML as an initial guess provided relatively small improvements in the balanced accuracy. Majority 
voting, in contrast, was less robust. Moreover, in the presence of a cartel, computing the MLE with 
majority voting as its starting point exhibited a multi-modal behavior, sometimes converging to a local 
maxima with a relatively low balanced accuracy. 

A more detailed study of the sensitivity of SML and majority voting and their respective improved 
iMLE solutions to the size of a malicious cartel with ir c — 0.5 is shown in Fig. [2j As expected, the 
average balanced accuracy of SML, voting or iMLE initialized using either voting or SML decreases as 
a function of the cartel's fraction, r, and once the cartel's fraction is too large all methods fail. In our 
simulations, both SML and iMLE initialized with it were far more robust to the size of the cartel, in 
comparison to both majority voting and iMLE initialized with it. With a cartel size of 20%, SML was 
still able to construct a nearly perfect predictor, whereas the balanced accuracy of majority voting and 



9 



Independent classifiers 



Cartel (33% of classifiers) 





0.95 










o 






CO 


0.9 










o 






o 


0.85 


+ 


ro 




TD 






CD 






O 


0.8 




an 




+ 


to 


0.75 




CD 





"a 

CD 



1/1 

QJ 

m 



c 



1/1 

E 
o 





0.9 


>- 




o 




ro 




□ 


0.8 


o 




o 




ro 




~o 


0.7 


CD 




O 




C 




ro 


0.6 


ro 




CD 






0.5 



o 

4-1 

u 
QJ 



en 
QJ 

m 



D1 



£= 



E 
o 



Figure 1: The Spectral Meta Learner (SML) is a good and robust meta-learner. The performance of 
SML is higher than that of majority voting (green) also in the presence of one cartel. In the presence of 
a cartel with target balanced accuracy of 0.5 (right panel), iMLE initialized with SML benefits from its 
robustness to cartels. In contrast, iMLE initialized using majority voting may converge to a poor local 
maxima. The boxplots represent the distribution of balanced accuracies of 3000 independent runs. 



iMLE initialized with it were both far from 1. Interestingly, the prediction of iMLE using SML as starting 
condition showed no significant improvement relative to the average balanced accuracy of SML itself. 

5.2 Real Datasets 

We applied our spectral approach to 8 common and publicly available datasets from several scientific and 
engineering applications. We used 33 standard machine-learning methods implemented in the software 
package Weka [32] as our suite of classifiers. Details on the datasets and the classifiers used appear in 
the Supplementary Information (Table SI) and Table S2). 

We split each dataset into a labeled part, and an unlabeled part, the latter serving as the test data 
D used to evaluate our methods. To best reproduce the problem setting of the student's dilemma, each 
algorithm had access only to a subset of the labeled data (i.e. each classifier was trained with a slightly 
different training set). For each of these eight datasets, the leading eigenvector of the modified covariance 
matrix was highly concordant with the classifier's balanced accuracies computed on the test set after 
disclosure of the true class labels, regardless of potential dependencies between them, with a Kendall's 
t correlation typically higher than 0.9. One exception was the Abalone dataset, for which all classifiers 
had poor accuracy and hence were difficult to rank. 

Next, we compared SML, majority voting and iMLE initialized at either of these two classifiers. As 
seen in Fig. |3j consistently across all datasets, iMLE initialized with SML had a higher mean balanced 
accuracy than iMLE initialized with majority voting. Furthermore, iMLE initialized with SML was more 
robust, with fewer outliers having low balanced accuracy. 
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Figure 2: SML is more robust to cartels than majority voting (left panel). iMLE using SML estimates as 
starting point is also more robust to cartels than iMLE using majority voting as the starting condition 
(right panel). For each meta- learner prediction the average balanced accuracy is shown (filled lines) 
together with the standard error (dotted lines, n=500 runs for each cartel's fraction). 



6 Summary and Discussion 

In the present work, we developed an unsupervised spectral framework to rank the performances of binary 
classifiers and to combine their predictions into an ensemble spectral meta-learner, SML, that is easy to 
construct and fast to compute. We showed that SML is equivalent to linearization of the MLE around 
(if), rf) = (1/2, 1/2). This is the only neighborhood where linearization of MLE is invariant to substitution 
of the unknown balanced accuracies by the corresponding entries of the eigenvector v. Interestingly, we 
found that in most cases the prediction returned by iMLE starting from SML is only slightly better than 
the prediction obtained by SML itself, suggesting that the SML solution nearly coincides with a local 
maximum of the likelihood function. In addition, we showed that SML is robust to cartels. Finally, we 
illustrated the applicability of the proposed methods on data from real-world problems. 

Our work raises several interesting problems for future research. First, most of our analysis was 
asymptotic, in the limit of an infinitely large unlabeled test set. A theoretical study of the effects of 
a finite test set on the accuracy of the leading eigenvector are of interest. This is particularly relevant 
in the crowdsourcing setting, where there is significant missing data in the prediction matrix fi(xk)- In 
principle, an estimated covariance matrix can be computed by using the complete observations for each 
pair of classifiers. However, perhaps an alternative approach of directly fitting a low rank matrix is more 
suitable. 

A natural extension of the present work is to analyze problems where the response class label is cat- 
egorical rather than binary (multi-class problems), or even continuous (regression problems). We expect 
that even in these problems the covariance matrix of the predictions of independent classifiers (or inde- 
pendent regressors) is a perturbation of a low-rank matrix. A modified covariance matrix, similar to the 
one proposed in our study, may improve the quality of existing methods. 

The quality of predictions may also be improved by taking into consideration instance difficulty, 
discussed in previous studies [Sj|T3] . In these studies there is an assumption that some instances are 
harder to classify correctly, independent of the classifier employed, with different analytic formulations 
proposed to model this difficulty. In our context, both very easy examples (on which all classifiers agree) 
and very difficult ones (on which classifier predictions are as a good as random) are not useful for ranking 
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Performance of SML and MLE compared to fhe inferred best predictor 
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Figure 3: Comparison of several classifiers on all eight datasets. Compared to MLE from voting, SML 
and MLE from it were overall more robust with fewer cases of low balanced accuracy, and in some 
datasets (PD, AD) achieved a significant higher median balanced accuracy. For each dataset, the boxplots 
represent the distribution of balanced accuracies across 1000 independent runs. 



the different classifiers. This suggests that in the presence of instance difficulty, if it can somehow be 
estimated, then it may be profitable to rank the classifiers by stratifying the data and removing these 
very easy or very hard samples. On the theoretical front, incorporating instance difficulty into our model 
may require additional and more restrictive assumptions concerning the independence between classifiers 
and between instances, for example at each difficulty level. 

The current formulation provides no measure of the confidence of class-label assignment using SML. 
A relaxation of ( 18 ) obtained by considering the argument of the sign operator can be used to assess 
the confidence of the class assignment of each instance. This formulation can be used with performance 
measures such as the Area Under the Receiver Operator Characteristic Curve. 

In the present work we also introduced the notion of cartels. The ability to identify such groups 
and their target, as well as to ignore their contributions, is of critical importance in many practical 
applications, such as electoral committees and decision-making in trading. We showed how the SML 
prediction asymptotically ignores moderately sized cartels. We conjecture that such construction is 
possible for tt c 1/2 even when the honest predictors are a minority. In this scenario Ac > Ap , thus the 
SML prediction should be constructed using the eigenvector associated to the second eigenvalue, Ap in 
this case. 



Materials and Methods 

Datasets and Classifiers 



In the present study we used 8 datasets for binary classification problems. With the exception of the 
Yale breast cancer dataset 22 , these datasets were obtained from the public ICS repository 23 . Details 
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on each dataset are provided in the Supplemental Information. The classifiers used in the present study 



have been previously described 24 or have been implemented in the software package Weka 32 



Statistical Analysis and Visualization 

Statistical analysis and visualization of results have been performed using MATLAB (2012a, The Math- 
Works, Natick, MA). Visualization of distributions has been performed using boxplots [25] . 
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Supplementary Information: The student's dilemma: ranking 
and improving prediction at test time without access to training 
data 

Fabio Parisi 1 '*, Francesco Strino 1 ^, Boaz Nadler 2 , Yuval Kluger 1 '* 



1 Covariance between different classifiers 



Proof of Lemma 2.1 To prove the lemma we first compute the mean /Xj = E[/j(X)] and the variance 
Var[fi(X)] of the i-th classifier. We then use these results to compute the entries of the population 
covariance matrix, qij = E[(/j(X) — Hi) ■ (fj(X) — fij)]. 

Under the assumption of independence between instances, the population mean ^ = E[/j(X)] of the 
i-th classifier is 

E[MX)] = Pt[fi(X) = 1] - Pr[/,(X) = -1] 

= Pr[/ i (X)=l|y = l]Pr[Y = l]+Pr[/ i (X) = l|y = -l]Pr[y = -l] (29) 
-Pr[/i(X) = -1\Y = l]Pr[Y = 1] - Pt[.U(X) = -1\Y = -l]Pr[F = -1] 

Using the definitions of sensitivity ipi — Pr[fi(X) = l\Y = 1], specificity rji = Pr[fi(X) = —1\Y = — 1], 
and class imbalance b = Pr[Y = 1] — Pr[Y = —1] , the equation above can be expressed as follows, 

k = e[MX)] = ^(^) + (i-^)(^)-(i-v^)( 1 ?)-^(¥) 

= fc-rn+biln + TH-l) (30) 
= 28, + b(2n, - 1) 

where 7Tj = (ipi + r)i)/2 and Si = (ipi - r)i)/2. 

Similarly, the population variance of the z-th classifier is 

Var [/,(*)] = E [MX) 2 ] - E [f,(X)] 2 = 1 - E [f,(X)] 2 = 1 - (28 { + 6(2^ - l)) 2 . (31) 

Next, we consider K[fi(X) ■ fj(X)]. Under the assumption of independence of errors between different 
instances and between different classifiers, for i =/= j 

E\fi(X) ■ f 3 (X)] = Pr[MX) = f 3 (X)] - Pr[MX) = -f 3 (X)] 

= (i±*) + (i±») (1 - - Vi) + (V) (1 - - Vi) + (¥9 ViVi 

(l±t) ^(1 - ^) - (1^) (1 - ^ - (if*) %(1 - lfc) - (If*) (1 - *)»& 

(32) 

Combining the three equations above yields that for i =/= j 



E[fi(X) ■ fr(X)] - E\ft{X)] • E[/,(X)] = (1 - fe 2 )(^ + m - 1)(^ + rij - 1) 

= (1 - 6 2 )(27Ti - l)(27T i - 1) 

Thus, the entry of the M x M covariance matrix between the M classifiers is 



(33) 



9 «H (2 7 r i -l)(2 1 7 r J -l)(l-6 2 ) (34) 

□ . 
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2 Direct eigendecomposition of the covariance matrix 

Proof of Lemma \2.£\ Let \(Q) be the leading eigenvalue of Q with corresponding unit-norm eigenvector 
w. Let A be the eigenvalue of the rank-one matrix Q with corresponding unit-norm eigenvector v. First, 
note that 

Q = Q + D (35) 

where D is a diagonal matrix with entries 

d i< = l-/i?-(l-6 2 )(27r i -l) 2 . 
Hence \\D\\2 = max; \du\ < 1. It thus readily follows from Weyl's theorem that 

|A(Q) - A| < \\D\\ 2 < 1. (36) 



Now we multiply the eigenvector equation Qw = A(Q)w from the left by w T , and insert the relation (35 1 
to obtain that 

A(Q) = A (w T v) 2 + w T flw. 



The lemma now follows by combining Eq. 36 with the bound |w Dw\ < 1. □. 



3 Spectral Meta-Learner 

In this section we present the derivation of the Spectral Meta-Learner (SML) as a linearization of the 
maximum likelihood estimator (MLE) of the vector of true class labels around (ip* , r/*) = (1/2, 1/2). 

3.1 Maximum Likelihood Estimator (MLE) 

Under the assumption of independence between classifiers and instances, given the specificities and sen- 
sitivities of the M classifiers, the overall likelihood of all S class labels is a product of the likelihood of 
each individual label. Hence, for each sample Xk its true class label y k can be estimated independently 
of the other class labels. The MLE y k ML ^ of y k is 

Vk } = argmax log £(/i(x fe ), . . . , fuixk)] Vk) 
y h e{i-i} 

= argmax {log £(/i(x fc ), . . . , /m (a*); Vk = l),log £(fi(x k ), fu{x k )\y k = -1)} 

Vk 

= sign (log £(fi(x k ), ■ ■ .,f M (xk);yk = 1) - log £(/i(x fc ), . . .,fM(x k );y k = -1)) 

= si s n Y 1o s(^) + Y - fa) - Y lo sd ~ **) + Y lo s(^) 

\ \i|/ l ( a = fc ) = l i|/i(x fc )=-l / \i\fr(x k ) = l i\fi(x k )=-l 

= sign ( 1 °g(V't) ~ lo g(! - %)) + Y ( log ( 1 ~ ^) ~ l0 S(^)) 

\i|/«(afc)=l i|/i(x*)=-l 

Next, recall that fi(x k ) € { — 1, 1}- Hence, the conditions fi(xk) = 1 and fi(xk) = —1 in the two sums 
above can be replaced by the following two indicator functions, 

1 + Mxk) _ J fi(x k ) = -1 
1 fi(x k ) = 1 
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and 

l-fi(x k ) _ f 1 fi(x k ) = -l 



2 \ fi(x k ) = 1 

Using these indicator functions, we express the MLE as a function of tpi and r\i as follows 

yi ML) = sign ^ 1 ± f Xk) (log(^) - io g (i - m)) + X: 1 ~ 2 ( " fc) ( lQ g(i - - lQ g(^))) 

/ M \ 

= sign \ ^2.h(xk)ioga l + log ft (37) 



where 

Oi= n ^ , and ft = M1ZJM (38 ) 
(1- Vi)(l-%) 

3.2 The SML: A first-order approximation of the MLE estimator 

The maximum likelihood estimate (MLE) of the label y ML of an instance Xk is given by 

The first-order Taylor expansion of the MLE, around specificity and sensitivity values {4>*,Vi) is given 

by 

(V^-V*) fo-tf) (V>* - i>*) + (Vi - v* 



V* ?7* i-V* 

+ 0((^ - V*) 2 , - r,*) 2 , (^ - V*) • (Vi - Vi)) 
+ 0((Vi - V*) 2 , - ^*) 2 , - ft) ■ (Vi ~ Vi)) 



At the specific values (ip*,r]*) = (1/2, 1/2), the Taylor expansion above simplifies to 

^(sml) = g . gn ^2 f.(x k } i^,. +rji - 1)^ = s ig n ^ST fi(x k ) (27r 4 - 1)^ = sign fi(x k )v^j , 

where v e R M is the leading eigenvector of the modified covariance matrix, as described in the main text. 
We thus call this novel ensemble-classifier the Spectral Meta-Learner (SML). 
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4 Covariance between different classifiers in presence of a cartel 



Proof of Lemma J^.l As in the proof of Lemma 2.1 we first compute the mean and variance, /ij = E[/j (X)\ 
and Var[fi(X)} respectively, of the i-th classifier. We then use these results to compute the entries of 
the population covariance matrix, = E[(fi(X) — jUj) ■ (fj(X) — fij)). 

The mean and variance for i 6 P have been computed in the proof of Lemma 2.1 We now focus on 
the mean and variance for i £ C . For brevity, in this section we will use the following notation : 

ji i = Pr[/ < (X) = l|T = l] ieC 
m = Pr[/«(X) = -1|T = -1] zeC 

V> c = Pr[T = 1\Y = 1] 1 > 

r\ c = Pr[T = -1\Y = -1] 

Under the assumption of independence between instances, the population mean for a cartel member 
fj,i = E[fi(X)] of the i-th classifier, with i € C is 

Ef/^JC)] = Pr[/ i (X) = l]-Pr[/ i (X) = -l] 

= Pr[/j(X) = 1|T = 1] Pr[T = 1\Y = 1] Pr[Y = 1] 

+ Pv[fi(X) = 1|T = -l]Pr[T = -1\Y = 1] Pr[y = 1] 
+ Pv[fi(X) = 1\T = 1] Pr[T = l\Y = -1] Pr[r = -1] 
+ Pv[fi(X) = l\T = -l]Pr[T = -1\Y = -l]Pr[Y = -1] (41) 

- Pv[fi(X) = -1\T = 1] Pr[T = 1\Y = 1] Pr[y = 1] 

- Pv[fi(X) = -1\T = -1] Pr[T = -1\Y = 1] Pr[F = 1] 

- Pr[/i(X) = -1|T = 1] Pr[T = l|y = -1] Pr[y = -1] 

- Pr[/i(X) = -1|T = -1] Pr[T = -l|y = -1] Pr[y = -1] 

which simplifies to 

E [f t (X)] = 6(l-^ c -»? c +n i (Vic+»7c-l)+ft(^c+»7c-l))+n < (V'o-»/ c -l)+Pi(^ c -»;c + l)+»7c-^c (42) 
Similarly, as previously shown, the population variance of the i-th classifier is 

Var [MX)] = E [MX) 2 ] - E [/,(*)] 2 = 1 - E [fr(X)] 2 (43) 

Next, we consider E[/j(X) • fj(X)}. We remark that the case i,j £ P was already considered in the 
proof of Lemma [2.1| Similarly, the case i, j € C is a special case of the proof of Lemma [2 . 1 1 when the 
truth is replaced by the cartel's target T. Thus, for these two cases, 

E[fi(X) ■ fj(X)} = (27T, - l)(27rj - l)(l - & 2 ) i + J, ieP,jeP 

E\fi(X) ■ f 3 (X)] = (2^ - l)(20 - l)(l - 6 2 ) i + j, ieCJeC l44J 

We compute E[fi(X) ■ fj(X)] for the cross terms when i 6 P and j € C. We define the balanced 
accuracy 7r c of the cartel's target T with respect to the truth, as well as its sensitivity tp c and specificity 
r\ c with respect to the truth. Under the assumption of independence of errors between different instances 
and between different classifiers, for ieP,jeC 

E[fi(X) ■ fj(X)} = Pr[f t (X) = fj(X)} - Pv[fi(X) = -fjiX)} 

= ((2^-l)((l-2n i )(l-Vc)-(l-2p j )V'c))(l + 6)/2 (45) 
+((2??, - 1)((1 - 2 Pj )(l - Vc ) - (1 - 2n,)/))(l - b)/2 

Combining the three equations above yields that for i € P, j € C 

E[MX) ■ f^X)} - E[fi(X)] ■ E\fj(X)] = (1 - b 2 )(^ + Vz - 1)(V» C + V c - l)(nj +p rf - 1) 

= (l-6 2 )(2 7 r i -l)(27r c -l)(2^-l) 



(46) 
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Thus, the entry of the M x M covariance matrix between the M classifiers is 



1 - $ i=j 
■2TTJ - 1) (1 - & 2 ) 

-l)(2^-l)(l-6 2 ) iEP, 

(2^-l)(2^-l)(l-6 2 ) i^j,i£C,j€C 



t _ . (27T, - l)(2^ - l) (l - & 2 ) i^j,ieP,jeP 
qij ~ S (2 7 r;-l)(27r c -l)(2£ j -l)(l-& 2 ) ieP,jeC 



□. 



From Lemma |4.1| it follows that the matrix Q can be written as a block matrix 

Q = 



Qp 


Qpc 


Q"pc 


Qc 



(48) 



where both Qp and Qc are rank one, and Qpc represents the interaction between classifiers in P with 
the classifiers in C. 



5 Matrix rank and eigendecomposition of the off-diagonal ele- 
ments of the covariance between different classifiers in pres- 
ence of a cartel 



Proof of Theorem ^.2 In the present proof, we simplify the notation using the following convenient 
change of variables: 

Pi = 2-Ki — 1 

r t = 2fc - 1 (49) 
u=(l-b 2 ); 

Suppose that the off-diagonal terms of the symmetric real- valued covariance matrix Q correspond to a 
rank-two matrix, then we can write them as a linear combination of the outer products of two orthogonal 
vectors, ei and e 2 (the eigenvectors): 

Qij = Ajeijey + \ 2 e 2i e 2y (50) 
We parametrize these eigenvectors in a block form. 



u 


anPi 


i 


e p 


u 


a 2 in 


i 


e c 


u 


a\ 2 pi 


i 


e P 


u 


a 22 n 


i 


e c 



A2e 2 i = < . _ ^ (52) 

Since the eigenvectors are orthogonal it follows that 

2J anPiai2Pi + a,2iTja 22 Tj = (53) 

Let us rewrite the matrix in Eq. [50]in a block matrix form by plugging the block eigenvectors defined 
in Eq.pjlland Eq.pl 
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upipj = u ■ aupi ■ aupj + u ■ a 12 pi ■ a Y2 pj i £ P,j e P 

up l p c T J = u ■ a n pi ■ flair, + u ■ a 12 pi ■ a 22 Tj i€ P,j € C (54) 
unTj = u ■ a 21 Ti ■ aa\Tj + u ■ a 22 r 2 ■ a 22 Tj i e C,j e C 

Hence, if an, a\ 2 , a 2 \ and 022 satisfy the following set of equations 

a n + a 12 = 1 

aii«2i + ai 2 a 22 = p c 

a 2 +fl 2 1 ^ 

"21 + "22 — 1 

(anai 2 )/{a 21 a 22 ) = ~(J2 je c T j) / '(E 4G p Pi) 
then the left hand side of Eq. [50] is a rank-2 matrix. 

Following a change of variables, with an = cos a, ai2 = sin a, 021 = sin/3, and 022 = cos /3, Eq. [55] 
reduces to 

cos a sin (3 + sin a cos /3 — p c . . 

(cosasina)/(sin/3cos^) = - ( E je c r |)/( E,: e p Pf ) 

Next, we show that the system in Eq. [56] is determined and that it has a unique solution up to a 
rotation. 

We define k\ = p c and k 2 — ( J2jec T j) I ( EieP Pf ) > an d simplify the system in Eq. 
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sin (a + /?) = fci 
sin (2a) /sin (2/3) = -fc 2 



(57) 



Defining <5 = a + /3, we obtain 



sin (5 = ki (58) 



cos <5 = Jl-kl ( 59 ) 



sin(2<5) = 2kiJl-kf (60) 
cos(2£) = 1 - 2A? (61) 



We rewrite Eq. [57] as follows, solving for a 



sin(2a) + k 2 sin(2<5 - 2a) = (62) 

sin(2a) + k 2 sin(2c5) cos(2a) - k 2 cos(2<5) sin(2a) = (63) 

/„ k 2 sin(2<5) 

tan(2a) = - - \ V ' 64 
1 — fc 2 cos(2d) 

a= 2 arctan (fe(i-i?)-i)' (65) 



and, similarly, solving for j3 



sin(2J - 2/3) + k 2 sin(2/3) = (66) 
sin(2<5) cos(2/3) + cos(25) sin(2/3) + k 2 sin(2/3) = (67) 

tan (2/3) = - - (68) 



/c2 — cos(2<5) 
2 \ 1 - fc 2 - 2k 2 



(3 = 1 arctan f f*/ 1 |H (69) 
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These solutions of a and /3 are unique up to a rotation with periodicity We recall that an = 
cosa,a 12 = sin a, <22i = sin/3, and a 2 2 = cos j3. The eigenvectors and their respective eigenvalues can be 
easily derived by back-substitution in Eq. [51] and Eq. [52] 

The system in Eq. 57 is therefore a determined system of two equations in two variables. It is thus 
possible to express the off-diagonal terms of the matrix Q as the linear combination of the outer products 
of two orthogonal vectors ei and e2, defined by the angles a and (3. Therefore, it follows that the 
off-diagonal elements of the matrix Q correspond to a symmetric real-valued rank-two matrix whose 
eigenvectors are the two orthogonal vectors en and e 2 . 

Importantly, the classifiers belonging to the P group lay on a line with angle a relative to the eigen- 
vector e\. The classifiers in the cartel, lay on a line with angle (3 relative to the eigenvector e 2 . □. 



6 Simulations and benchmarks 

The following section describes how we generated the simulated data and how we performed the bench- 
marks. For each component of the simulations we also provide pseudo-code 

6.1 Simulated data: Ensembles of statistically independent predictions 

We generated ensembles of statistically independent predictions using previously described random de- 
tector with fixed balanced accuracy (RDFBA) [24]. A generic RDFBA predictor with pre-determined 
balanced accuracy equal to 7r is denoted by RDFBA(7r). RDFBA(7r) predictions are used to simulate 
predictions from independent classifiers. RDFBAs are constructed such that their balanced accuracy is 
equal to 7r, although the sensitivity ip and specitificity 77 may be different for equal choices of 71". 

Following the standard machine-learning notation, P is the number of positives, i.e. the number of 
instances whose true class label is +1; N is the number of negatives, i.e. the number of instances whose 
true class label is -1; FP is the number of false positives, i.e. the number of negatives that have been 
mistakenly predicted as positives; FN is the number of false negatives, i.e. the number of positives that 
have been mistakenly predicted as negatives. Thus, an RDFBA(7r) prediction is constructed from the 
ground truth vector y as follows: 

1. the entries of prediction vector f{x) are initialized with the corresponding entries in the ground 
truth vector y. 

2. Under the constraint that FN = (2 — 2tt — FP/N) • P is an integer, a random integer number FP is 
drawn with uniform probability from [0,N]. 

3. FP instances in f(x), whose true label is —1, are assigned the wrong class label, +1. 

4. FN instances in f(x), whose true label is +1, are assigned the wrong class label, —1. 

The advantage of using RDFBA predictors is that each prediction satisfies the assumption of inde- 
pendence between predictors. 

In our simulations, we used P = N = 300, and it e [0.3, 0.8]. 

6.2 Simulated data: Ensembles of uncorrelated predictions with a small car- 
tel of strongly correlated predictors 

In order to generate datasets of uncorrelated predictions where a small (r • M predictors) cartel was 
present, we first generated an ensemble P of (1 — r)M independent predictions as described above for 
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the ground truth vector y. Then, we constructed the cartel's target vector c, i.e. a vector alternative to 
the truth that is supported by the predictions in the cartel. The vector c is constructed as an RDFBA 
prediction with balanced accuracy tt c . For this vector c we constructed an ensemble C of independent 
predictions similarly to the procedure described for the statistically independent predictions with the 
only difference that the balanced accuracies of all members of the cartel relative to the cartel's target was 
set to be equal to 0.7. The dataset is obtained by the union of the two ensembles of predictions, P and 
C. In our simulations we used n c = 0.5 thus obtaining a cartel's target that is orthogonal to the ground 
truth. 

6.3 Real data: Ensembles of predictions from standard machine-learning 
classifiers 

To generate ensemble of predictions from standard machine-learning classifiers on real data, we trained 
the classifiers on partially overlapping training data and collected their predictions obtained on the same 
testing data, which was independent from all the training data. In detail, from each dataset we sampled 
600 instances (or all the instances if less than 600 were available) , half of which (up to 300) were used for 
testing. Independently for each classifier, we selected a random subset coprising of 90% of the instances 
reserved for training and used this subset as a "private" training set. The purpose of this procedure 
was to produce training data that was slightly different between the different classifiers, allowing, at the 
same time, to have a significantly large number of training samples even in the smaller datasets. We 
chose to use at most 600 instances to reduce computational time. To determine the empirical distribution 
of performances of each classifier and of the ensemble approaches discussed in the manuscript, for each 
dataset we repeated this procedure 1000 times. 
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7 Supplemental Tables 



Table SI: Summary of the datasets from the UCI repository 23 . 




Dataset 


Instances 


Features 


Class Reference 


Spambase data - SD 


4601 


57 


spam/not spam 


23 




Yale breast cancer dataset - YBC 


650 


6 


nodal status 


22 




Wisconsin breast cancer dataset - WBC 


699 


10 


benign/malignant 


26 




Parkinson data - PD 


197 


23 


affected /unaffected 


27 




MAGIC Gamma Telescope - MGT 


19020 


11 


signal /background 


28 




Ionosphere data - ID 


351 


34 


Good return/Bad return 


29 




Mammographic masses - MM 


961 


6 


disease severity (2 classes) 


30 




Abalone data - AD 


4177 


8 


male/female 


31 
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Table S2: Summary of the machine learning classifiers from Weka |32 . 



classifier / meta- learner 


Weka class 


KNN (k=l, odd) 


lazy/IBk 


KNN (k=2, even) 


lazy/IBk 


KNN (k=5) 


lazy/IBk 


k-Star 


lazy/KStar 


DecisionStump 


trees/DecisionStump 


J48 


trees/ J48 


REPTree 


trees/REPTree 


JRip 


Rules/ JRip 


LMT 


trees/LMT 


LWL 


lazy/LWL 


Regularized Logistic regression 


functions /Logistic 


Logistic regression 


functions/SimplcLogistic 


Sequential Minimal Optimization 


function/SMO 


NaiveBayes 


bayes /NaiveBayes 


M5P 


rules/M5P 


OneR 


rules/OneR 


PART 


rules/PART 


RandomForest (n=10 trees) 


trees /RandomForest 


RandomForest (n=20 trees) 


trees /RandomForest 


Multilayer Perceptron 


functions/MultilayerPerceptron 


Voted Perceptron 


functions / VotedPerceptron 


SGD 


functions/SGD 


Voting 


meta/ Vote 


Stacking 


meta/Stacking 


AdaBoost + NaiveBayes 


meta/AdaBoostMl 


AdaBoost + Logistic Regression 


meta/AdaBoostMl 


AdaBoost + J48 


meta/AdaBoostMl 


Bagging + REPTree 


meta/Bagging 


Bagging + RandomTree 


meta/Bagging 


Bagging + RandomForest 


meta/Bagging 


LogitBoost + ZcroR 


meta/LogitBoost 


LogitBoost + KNN 


meta/LogitBoost 


LogitBoost + DecisionStump 


meta/LogitBoost 



8 Supplemental Figures 
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sin(a)| 




-1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 1 

k i 

Figure S4: The heatmap shows the absolute value of the angle between the truth and the eigenvector ei, 
on which the SML prediction is based. The dark area between the two red lines graphically shows the 
relationship between k\ and such that \a\ < 6°. The figure shows that SML is robust to cartels: when 
a R3 0, the honest classifiers lie approximatively on the eigenvector e\. 
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Independent classifiers Cartel (33% of classifiers) 




123456789 >9 123456789 >9 

Inferred rank of the best algorithm Inferred rank of the best algorithm 



Figure S5: The largest entry in the leading eigenvector often corresponds to the best classifier in the en- 
semble. In the plots, each bar represent the empirical probability that the entry in the leading eigenvector 
corresponding to best classifier attained a specific rank. 
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